Skip to content

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554

Merged
davidLif merged 2 commits into
kai-scheduler:mainfrom
david-gang:dgang/issue-1529-skip-dra-claims-for-terminal-pods
May 18, 2026
Merged

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554
davidLif merged 2 commits into
kai-scheduler:mainfrom
david-gang:dgang/issue-1529-skip-dra-claims-for-terminal-pods

Conversation

@david-gang
Copy link
Copy Markdown
Contributor

Description

GetPodMetadata in the podgroupcontroller called FetchPodResourceClaims unconditionally on every reconcile, including for pods in Succeeded/Failed phases. The DRA driver removes per-pod ResourceClaim objects as soon as a pod reaches a terminal phase (independent of pod deletion), so the lookup always failed with NotFound and produced spurious ERROR logs on every reconcile cycle until the pod object itself was eventually cleaned up — which can take an indefinite amount of time when Job.spec.ttlSecondsAfterFinished is unset.

ERROR  Failed to calculate metadata for pod <ns>/<pod>
{"error": "failed to get resource claim <ns>/<claim> for pod <ns>/<pod>: ResourceClaim.resource.k8s.io \"<claim>\" not found"}

The fix adds a phase guard at the top of GetPodMetadata. Terminal pods skip the ResourceClaim lookup and return empty RequestedResources/AllocatedResources — which is what the existing isActivePod/isAllocatedPod guards downstream would have produced anyway.

This is the same root cause as #1455 (fixed scheduler-side in #1456), but the podgroupcontroller metadata path was missed at the time.

Related Issues

Fixes #1529

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

None.

Additional Notes

Tests

  • TestIsTerminalPod — table-driven check across the four phases, matching the style of the existing TestIsActivePod/TestIsPodAllocated.
  • TestGetPodMetadata_TerminalPodSkipsResourceClaimLookup — integration-style test with a fake.ClientBuilder and a Succeeded/Failed pod referencing a missing ResourceClaim. Verified to fail (nil-deref panic from FetchPodResourceClaims) without the fix and pass with it.

Validation

  • go test ./pkg/podgroupcontroller/...
  • go vet ./pkg/podgroupcontroller/...
  • gofmt -l pkg/podgroupcontroller/controllers/metadata/ clean

… pods

GetPodMetadata fetched per-pod ResourceClaims unconditionally, even for
pods in Succeeded/Failed phases. The DRA driver removes those claims as
soon as the pod reaches a terminal phase, so the lookup always failed
with NotFound and produced spurious ERROR logs on every reconcile until
the pod was finally deleted.

Add an early return for terminal pods, mirroring the scheduler-side fix
in kai-scheduler#1456.

Fixes kai-scheduler#1529

Signed-off-by: David Gang <dgang@lightricks.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2d57530-24cf-449d-9fe2-b19bd9475621

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@david-gang
Copy link
Copy Markdown
Contributor Author

The failing E2E Upgrade Tests job is unrelated to this PR.

  • Helm upgrade itself succeeds (REVISION 2, status: deployed). The timeout happens afterward, waiting for KAIConfig.Status.Available=true at test/e2e/modules/wait/kai_config.go:41.
  • This PR only touches pkg/podgroupcontroller/controllers/metadata/pod.go (a phase guard in GetPodMetadata) — no path to KAIConfig reconciliation or operator startup.
  • An in-flight chart fix ("fix(chart): stop recreating kai-config CR on helm upgrade") is currently iterating on this exact area on main, indicating the upgrade flow is broken independently of this change.
  • All other jobs passed: unit tests, validate, build, regular E2E, FOSSA, coverage.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata 23.53% (+6.86%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod.go 25.00% (+7.22%) 48 (+3) 12 (+4) 36 (-1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod_test.go

enoodle
enoodle previously approved these changes May 7, 2026
davidLif
davidLif previously approved these changes May 18, 2026
@davidLif davidLif dismissed stale reviews from enoodle and themself via 1eef996 May 18, 2026 07:25
@enoodle enoodle enabled auto-merge May 18, 2026 07:26
@davidLif davidLif disabled auto-merge May 18, 2026 07:26
@davidLif davidLif enabled auto-merge May 18, 2026 07:26
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata 23.53% (+6.86%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod.go 25.00% (+7.22%) 48 (+3) 12 (+4) 36 (-1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod_test.go

@davidLif davidLif added this pull request to the merge queue May 18, 2026
Merged via the queue into kai-scheduler:main with commit 131a693 May 18, 2026
11 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(podgroupcontroller): logs errors for completed/failed pods when ResourceClaims are already deleted

3 participants